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0 A orogram executing on a first processor m an 
fvlP configuration awaiting the release of a resource 
held by anotner processor, detects the expiration of 
a fixed ;ime interval, and initiates a hierarchy of 
recovery actions designed to cause the resource to 
be freed. These actions, targeted at a processor 
believed to be the one currently holding the re- 
source, are taken only if . that processor )S not ex- 
ecuting an "exempt" routine. The actions, taken in 
order of increasing seventy, are: wait for a second 
fixed time interval; terminate the routine on the 
resource-holding processor, allowing retry; terminate 
the routine without allowing retry; invoke Alternate 
CP Recovery. The hierarchy is escalated against the 
target processor until that processor releases the 
resource, and against other processors in the con- 
figuration until the resource is acquired by the first 
processor. These actions may proceed in parallel for 
multiple detecting and target processors within an 
MP environment. 
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Systematic recovery of excessive spin loops in an n-way mp environment 



® A program executing on a first processor in an 
MP configuration awaiting the release of a resource 
held by another processor, detects the expiration ot 
a fixed time inten/al. and initiates a hierarchy of 
recovery actions designed to cause the resource to 
be freed. These actions, targeted at a processor 
believed to be the one currently holding the re- 
source, are taken only if that processor is not ex- 
^J^iJ^'f^g 3n "exempt" routine. The actions, taken in 
order of increasing severity, are: wait for a second 
^ fixed time interval; terminate the routine on the 
yJJ resource-holding processor, aJlowing retry; terminate 
the routine without allowing retry: invoke Alternate 



in 



CP Recovery. The hierarchy is escalated against the 



^target processor until that processor releases the 
resource, and against other processors in the con- 
® figuration until the resource is acquired by the first 
processor. These actions may proceed in parallel for 
UJ multiple detecting and target processors within an 
MP environment. 
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SYSTEMATIC RECOVERY OF EXCESSIVE SPIN LOOPS IN AN N-WAY MP ENVIRONMENT 



c.-rc-jfTirrTg. Mere scectficaiiy, t -elates :d 
-'-r-.'^arisms '2f re'ectirg ana reccvenng 'rem sctn 
: :L3:.crs n T_iiicr:-c=53cr sysrem coorigura- 

:C.n occ .5 3 :cr,aiticn wnicn occurs in a 
r^.'^ :cr:cessor tiMP) sysiem -vnen a rottine execui* 
:r.g :n :ne Central Procasscr iCP) :S 'jnacie to 
:2rr:Z\e:e a :uncuon Cue to a depenaence cn some 
z-z:\cn cemg :aken on another CP. if :ne function 
mus; ce ccmoiereo cefore furtner processing can 
De :erfcrmea. :ne rcutine may enter a ioop arc 
5Ctn /vaiting :or :ne '•equired action to ce taken on 
tne ether CP 

Spin .ocps typically occur in systems such as 
MVS.XA ana .V1VS/ESA when a system routine ts 
anempting to perform one of the following func- 
tions: 

1 . Communicate with other CPs • For exam- 
pie, wnen an MVS system routine running on one 
CP determines that an address space should be 
swaoped out of main storage, It is necessary to 
notify all other CPs to purge their translation 
'ockastde buffers of addresses related to that ad- 
dress space. This is accomplished by issuing a 
SiGP (Signal Processor) Emergency Signal to the 
other CPs. Until each CP responds with an indica- 
tion that :t has performed the required purge, the 
initiating MVS routine will enter a spin loop to await 
com.pietion of the required action. 

2. Serialization of function across all CPs ■ 
MVS uses system. :ccks to serialize execution of 
many fLinctxns across ail of the CPs in the system. 
This (S -ecessary to ensure the integrity of the 
ooeraticn zi-rq performed. The general locking 
architecture jseo in the MVS system is described 
in the ;3M Technical Disclosure bulletin. Dec. 
1973. Volume 16, No. 7, at page 2420. As an 
example, .f an MVS routine on one CP wishes to 
process tne results of an I/O interrupt from a de- 
vice, it must ensure that status about the interrupt 
Is not ■nacverently corrupted by a system routine 
cn another CP wishing to initiate a new t/0 opera- 
tion to tne cevice. This is accomplished via the use 
of a system lock per device. If a system routine 
requires the took for a given device which is owned 
by a routine on another CP. it will enter a spin loop 
until the lock becomes available. 

Spin loops are a normal phenomenon of an MP 
system. They are almost always extremely brief 
ana non-disruptive to the operating environment. 
However, when their duration becomes excessive, 
spin loops become a problem which requires re- 
covery action to resolve. In the prior art. those 
actions were determined and performed by the 



5y3tem rcerator 

Excessive scm .cop \ESLi ccrciticrs :ir r.i 

tncgereo .'or a .vice variety of causes. -?.<arz t 

the CP .vrtcn >s rcictng a resource rrCL 'e-. 
5 routine SDinmrg cn another CP may te: 

V E.xper:er.c:r^.g a -^arcware 'aiiure 

0 E.xcertenc:ng a software failure 

0 Performing a critical I'uncticn .vricn taKr: =r. 

unusually long period of time to ccmpiere 
•'0 0 Stopped by the operator or by re zzeny-^z 

system 

In the past, the MVS operating svster- :e- 
tected the existence of an ESL anc =uracrc :re 
conaition to the system operator. The cetecticr 

■ 5 was performed by the routine in tne scin .ccc. after 
spinning for a full ESL timeout interval, .vmch .vas 
approximately 40 seconds in MVS. It then ;nvcKea 
the Excessive Spin Notification Rcutine. to issue a 
message to the operator requesting recovery ac- 

20 tion. 

Determination of the correction recovery action 
to resolve an ESL condition is comciex. errcr- 
prone, and especially critical given the severe m- 
pact such a condition has on the operating system, 

25 Due to the frequency of inter-processor ccmrr.u- 
nication and cross-CP resource serialization n an 
MP environment, when one CP fails, all ether CPs 
very quickly enter spin loops until the prcblem cn 
the failing CP is resolved. 

30 According to the prior art. there .vere three 

recovery actions that an operator can take when an 
ESL occurs. Each has benefits and drawbacks as- 
sociated with it. The actions are as follows: 

1. Respond to the ESL message to continue to 
35 spin on the detecting CP for another excessive 

spin loop interval. 

This will only have benefit if the cause of the 
spin loop is temporary, i.e.. if it is due to scm.e 
unusually lengthy but legitimate processing on the 

40 CP causing the condition. 

The problem here is that neither the operator 
nor MVS knows whether the condition is temporary 
or not. If the operator does not respond to continue 
the spin and instead performs a recovery action. 

45 the possibility exists that an important MVS system 
function will be the target of that destructive recov- 
ery action. This may even result in an unnecessary 
system crash. 

On the other hand, if the operator does decide 

50 to continue the spin, how many times shoula the 
spin be allowed to repeat before taking a more 
forceful action? Each response to continue in the 
spin k3op further prolongs the time that the. system 
is unavailable. 

2. Respond to the ESL message to tn'gger :.ne 
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MVS - ter-ite ZP ^ecr.ery ACn'j func::cn 'zr :he 
•'i;:ir,g C? "'"e :erer.2i ACR "''^rcticn '5 :escricea 
n 'EiM 'ecnmcai Zisclcsure Suilenn. Mcv. i973, 
■;:i:jrr.e *S. rNo. 6, a; cage 20G5 ^--e aigonthm 
-seo :d :5;ermir9 -vhicn .3 ;re 'aiiing CP in 30 N- 
.vav rn'.'Tonrrenr 5 :escr'ced 'n 'BM '"ecr'nicai 
I)5c:csijre 3Li!et:n, Jt-y :933. Voiu.-ne iS. Nc. 2 ac 

- - >J w ^ 

Tr:s :a'jses :re 'eccvery .'outmes crctecting 
:"e orc(^t3n n^rnir.g on :he ^'ailing CP ;o oe in- *o 
.-eked. This 'S -rone :c aiiow 'he recovery routines 
;o release -esources he!d on the failing CP whtch 
T;ay be reautred by :he CP currently in a soin loop. 

The ::rawbacK of :his action is that it aiso 
'esuits in removing the "-aiiing" CP from use by :5 
:r:e MVS operating system. Exoenence has shown 
;hat excessive som loocs are usually caused by 
r^.cn-CP related hardware or software errors. The 
'ecovery crocessing associated with ACR may re- 
scive :he scm loop cut removing the CP from the 20 
configuration is hignly aisruptive and also unnec- 
essary m the maionty of spin loop scenarios. 

Even with a highfy-SKilled operator, who deter- 
mines ano performs each recovery action arter only 
30 seconds delay, the system is completely un- 25 
available for several minutes. In addition, the CP is 
unnecessarily removed from system use for an 
undetermined period of time. 

Another drawback of the ACR action can be 
that recover/ routines are allowed to retry after 30 
being invck-: :. Therefore, the ability of the ACR 
action to resolve the spin loop and avoid a system 
outage is highly dependent on the effectiveness of 
the recovery routines protecting the failing pro* 
gram. If the recovery routines do not release the :5 
resources -equired by the CP in the spin loop, or 
retry cacK to a comt in the failing program which 
causea '.ne problem to begin with, the spin loop 
ccncijicn .viii not be resolved. 

3. Respond to the ESL message to continue jo 
the spin on the detecting CP ANO initiate a RE* 
START from the system console to interrupt the 
--cutine executing on the failing CP. This action will 
trigger invocation of recovery routines to force the 
release of resources held on the failing CP. js 

The drawback of this action is that it results in 
ternr.ination of the current unit of work because 
recovery routines are not allowed to retry when 
RESTART is invoked. Thus, even though the re- 
covery routines may be able to successfully re- so 
solve the problem causing ttie spin loop, the pro- 
gram is forced to terminate. If a critical job or 
subsystem is active on the failing CP when the 
spin loop is detected, invocation of RESTART will 
cause loss of that critical subsystem and perhaps 55 
require re-iPL of the system. 

Another drawback is that the RESTART proce- 
dure is more complicated than simply responding 



"0 3 message and 'S ;r-rr6r':'-= zr:-i - 

Trror. 

Most ESL :cnc:incns. Zie :o :ze'2::r 
•naaequate - : ;overy ccticns. rrc .vi;,-i i i.i:--- 
:rasn ira an extencea cutace rec'-irxc 

in iciaiticn to the ccno'e/.ties :f ;re . 
:ecisions reautrea by ■:ne OQera\-zr :c 'eo:.Tr 
an ESL ccnoiticr^. the .recnarics :r r-*''ec:.r: -3: 
recovery ceccrr.e sfgnificart'y mere trvcr.-rO - 
ccerator is unable :o answer te iz-r- zcz -t;- 
sage and instead must resccna :o :ne zC-.^ ::z 
restartaoie wait state, rcr ^.<a.T-c'e. 'or in -C.=^ 
resoonse, the ooerating procecure mvcivei: 

1. Stooping all CPs m ;pe system 

2. Stonng the ACR response in -^am rrraoe 
on the detecting CP fwnicn may ce . = :£i.on 
the installation's policies) 

3. Starting all the CPs excect :-9 ze:^::-':; 
and failing CPs 

^. Restarting the aetecting CP :o .nitiare -e- 
ccvery. 



SUMMARY OF THE INVENTION 

The present invention is a system and crccess 
in a multiprocessor system environment, for oe*ec:- 
ing and taking steps to automancaily recover irzn 
excessive spin loop conditions, it comprises 'unc- 
tions and supporting indicators that dearly idsniiiy 
true spin loop situations, and oreseni a nierarcntcai 
series of recovery actions, some new to the ESL 
environment, that minimize the imoacE of the con- 
dition to the multiprocessor system, and its -vcr- 
kload. 

It is an object of the present invention to pro- 
vide an automatic and efficient mechanism for de- 
tecting and recovering from excessive scm icoo 
situations in an MP environment. 

it is a further object of this invention to reccg- 
nize persistent, related spin looo situations m an 
MP environment, and recover automatically r>cm 
them. This includes recovering in parallel from mul- 
tiple ESL occurrences involving more than one 
failing CP. 

It is a further object of this invention to present 
a hierarchy of recovery actions representing pro- 
gressively more severe actions, so that a severe 
action is taken only when a less severe action has 
failed to resoive the problem. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a linear time flow diagram showing 
an oven/iew of the Excessive Spin Loop Recovery 
(ESLR) Function operating in a 2-way MP envircn- 
ment. 
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E'TrSSi-.e 5c:r: Lccc necovery crcc5SSing. 

r''^ J s 3 .'unction flow ciaQrarn showing the 
':^.'ir:rv rr -ecovery actions :3Ken .vithin 55L.^ 

- z -i .s 3 'inear :in:e 'lew jiagram 5ncv/ing 
1 ::r^:ar:c -vhicn E3LR orccessing is tsea :o 
'T;:'.e 2 ."T •CCD reacicck Situation :n a o-way 

- g'jre 1 srcws an envircnment m wnrcn an 
f -^.cccir-ent of :ne present r,vention operates. It 
=ilus:ra;es a 2-way MP systenn ccnsis:ing of Central 
.^^ccesscr [^0) and Central Processor 2 (11). 
Ce.ntrai Frccesscr i, having obtained spm-type lock 
X ar '.-rr^.e [0 (lOi). subsequently enters a disabled 
:ooo (!02): Central Processor 2, requesting spin 
locK .< at time to * i (iiO). is unable to obtain it. 
and 30 "spins ', periodically re-requesting the lock 
(111). 

As /vith systems of the prior art. it is the 
resocnsibiiity or the processes which have request- 
ea a spm-type :ock to determine that a "long" time 
has e^apseo smce the lock was requested fa time 
interval referred to as the £SL. or Excessive Spin 
Loop, ^nter/al); having recognized that this period 
of time has elapsed. {Ii2). the requesfing proces- 
sor invokes the Excessive Spin Loop Recovery 
(E3UR) processing of this invention (113). This 
processing ultimately results in the release of the 
lock by processor i (103). and allows the subse- 
quent acquisition of the lock by processor 2 (i 14). 

Referring to figure 2. excessive spin loop re- 
covery processing is entered when the CP request- 
ing the iock detects that it has been waiting for the 
iccx for an excessive amount of time. On entry, this 
routine checks to determine whether excessive 
spin loop recovery processing is active on any 
other CP in the complex by checking the CVT 
glooat control block (24) via the atomic 'Test and 
Set'* operation. If the answer is yes, there is an 
immediate return and this indication is not treated 
as a detection of an excessive spin loop. 

if the answer is no. the failing CP is identified 
as indicated in the aforementioned TDB (Vol. 26. 
No. 2, at p. 748). and the identity of the failing CP 
is saved. A check is then made to see whether any 
spin loop recovery action was taken for this failing 
CP within the last excessive spin loop interval. If 
so. subsequent recovery processing is bypassed. 
In tightly-coupled MP systems of three of more 
CPs, this is done because two different CPs could 
enter ESLs against the same failing CP within the 
same interval. When the first of these two ESLs 
results in a recovery action, the second ESL must 
be prevented from initiating another (more disrup- 
tive) action before the first one has a chance to 
complete. 

The Excessive Spin Loop Recovery Processor 



'E5L.R) .•maintains a :ac:e ;n -zccai itcracs i":.-. 
the ::rT^e--cf :he ^ast E5L 'eccvery =c::cn •3-r.' 
rigams; eacr: CP This Last Action "aken la" 
Ticie '25; ^35 cne entry cer "P. E5LR tren 

5 cares :he ::cck /aiue cn entry -vtth the LAT / 
■'cr :re -'aning CP if an ESL rtervai ".as net zuizZ 
Since ;he 'ast action agamst this 'aiiing CP ^: 
action ;s taKen. However, tre ast :etect;cr: ::rr^ 
LASTDT 23) fie:d s jcdatec icecause ;r;3 cetec- 

'C ticn r^:usr te 'Bcorcsa to ensure the crocec :efer- 
-Tiination of a, persistent probie.n. "re :!cck /a:-e 
:3 again obtained and :hen stereo m the giccai ESL 
field (28), indicating that this cetection .s treatec as 
a giobal detection, ana the routine returns to re 
;5 cailer. 

If no action was taken for this CP -vithm t.-e ast 
ESL interval, a check is made to see if an ESL -vas 
aetected agamst any CP /vithm the last :-ao ESL 
intervals (23). 

20 The question here is whether two consecutive 

(ESL) occurrences represent repeated manifesta- 
tions of the same problem (i.e., a persistent proo- 
lem) or whether each ESL occurrence represents a 
separate problem. If an ESL is identified as occur- 

25 ring for a persistent problem, the recovery action 
for that ESL will be the next one in the series of 
increasingly severe actions for that pantcuiar faiiing 
CP, 

If an ESL is determined to be the initial mam- 
30 festation of a problem, all the ESL indicators for ail 
CPs are reset so that any sequence of actions for 
any CP starts at the first action. 

The Excessive Spin Loop Recovery Processor 
(ESLR) maintains a field (LASTDT) (29) in giobal 
3S storage showing the time of the last detection of an 
ESL against ANY CP. 

A persistent problem exists if: T-LASTDT < 
2xESLI where: 

T - time of this entry to the ESL Recovery routine 
40 ESU = excessive spin loop interval. 

When processing of this ESL is complete, LASTDT 
is updated with the current time at exit from ESLR 
process. 

Given that time between entries to ESLR from 
45 a given spin routine is equal to ESLl plus a very 
small delta consisting or linkage time from the spin 
routine to the ESLR process, it follows that the spin 
routine will continue to call ESLR in less than two 
spin loop time-out intervals until it has obtained its 
50 acquired resource. However, a given invocation of ^ 
ESLR may be locked out if another CP has already 
serialized the ESLR function. Therefore, ESLR 
must be cognizant of ail entries to ESLR from any 
CP. If no entry to ESLR occurs from any CP for 
55 two or more spin loop time-out intervals, then it 
follows that ALL spinning routines obtained ALL 
their desired resources subsequent to the last call 
to ESLR. 
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^'^e rrecx s a :eierrrira;icn .vretner "he 
•aii'rg CP !3 ;T :ac: rvecurirg a rouiire ;hat s 
exe'^cted .'rem excessive som :ccc reccvery on- 
•ressirg (incicatea m :he lCCA biock .'2^j). a 
^.ecrariSiTi -'or prove irq sucn an exerncncn :s 
'■rcutrea cecause there are egninaie system -ou- 
:;nes .vncn :cuic3 otherwise fngger c3L :onGitions 
ceca'jse \re 'jme 'o :crT"ciete tne f'uncticn exceecs 
ir.e E5L iime-C'jt va.ue. This ai'ows :he system 
'OLtires to se[ an Taicator srccnc :he engthy 
rurcticn 'n a 'ieid c."ecked by the ESL recovery 
process. This exemoticn mecnanism allows the 
£SL interval to be reduced far ceiow its vaiue in 
previous MVS systems of ^ seconds to signifi- 
cantly improve ESL recovery perfornnance. It elimi- 
nates the need to spin for such long pencGS to 
avoid an ESL detection and recovery action for a 
legitimate, temporary condition. Some MVS func- 
f.ions included in this validly exemoted category are 
those which load restartable CP wait states for 
cperator communication, place a CP temporarily in 
a stopped state, or communicate with the operator 
via disabled console communication facility. 

If the failing CP is not executing an exempt 
routine, recovery action is initiated for that failing 
CP. This recovery action processing is further de- 
scribed' in Figure 3. Having taken the appropriate 
recovery action, the current clock value is placed in 
the LAT field (26) of the failing CP and the global 
ESL field (LASTOT (23)) and return is made to the 
caller. 

Referring to Figure 3, on entry to recovery 
action processing an index is incremented asso- 
ciated with the failing CP. A check is then made 
against the value of the index. If the value equals i, 
a return is made to the caller. This results in a 
ccnrinuation of spinning on the desired lock for 
another ESL interval, it is important to wait for this 
additional ESL interval since it is possible that a 
call may have been made to excessive spin loop 
recovery processing in the window of time between 
the clearing of the exemption flag and the enabling 
of the associated CP and in this case no disruptive 
recovery action is desired. 

if the index is equal to 2, an indicator is set in 
the CVT control block indicating ABEND as the 
recovery action. A Signal Processor instruction in- 
dicating restart is then issued to the failing CP to 
give control to the restart FLIH. Return is then 
made to the cailer. On the failing CP the RESTART 
FLIH checks the CVT indicator and sets a flag 
indicating the ABEND action and passes control to 
the Recovery Termination Manager to execute the 
ABEND action, which allows the recovery routines 
to retry after performing any necessar/ clean up. 

if the index is equal to 3, the CVT flag is set to 
indicate the TERMINATE recovery option. A signal 
processor instruction indicating restart is then is- 
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5-WAY EXAMPLE 

Figure ^ illustrates Excessive Scm L:cq Re- 
covery processing active in a S-WAY MP s/srem. 

cQ with two independent excessive spin loops: :he nrst 
involves CPs 0, 1 and 2 all waiting for a -esoLrce 
held by failing CP 3; the second involves CP -1 
waiting for a resource held by failing CP 5. T-e 
example shows: 

25 1. Simultaneous resolution of indepercent 

ESLs 

2. Correct progression through ;he r^ierarcr.y 
of recovery actions for each ESL taking ncreas- 
ingly severe action when previous action **aiiea :o 

30 resolve the problem. 

3. Pacing of actions taken for related ESLs 
(multiple CPs spinning on the same failing CP). 

At times. T. T + 2. and T + 3, the waiting CPs 
(0, 1.2 and 4) request the needed resource cf CP 

35 3 or 5. At T + 10. CP 0. noticing that an E3L 
interval (here, 10 seconds) has elapsed .v;thcut 
obtaining the resource, calls ESLR processing, 
which sets the CP 3 index to 1 and saves the 'ime 
of this ESLR processing (T + iO.l) m the LAT field 

40 for CP 3 (figure 28 at 26). and LASTDT (.28). and 
then returns to the caller who continues to som fas 
indicated in figure 3. since this is the initial detec- 
tion). At T + 12. CP 4 detects an ESL. calls ESLR, 
which sets the CP 5 index to 1 and saves the :ime 

^5 (T+ 12.1) in LAT entry for CP 5 (26) and LASTDT 
(28), and then continues to SPIN (fig. 3). Simulta- 
neously at T + 12, CP 1 detected an ESL and 
invoked ESLR • which immediately returned since 
ESLR was already active on CP 4 (see fig. 2A at 

50 21). At T+ 13. CP 2 detected its ESL. called ESLR. 
which tal<es no recovery action since one -vas 
taken for this failing CP (CP 3) within the iast ESL 
interval (see fig, 2A at 22). The time (T + i3.n is 
saved in LASTDT (28). At T + 20.1, another ESL 

53 interval having passed for CP 0. ESLR is again 
invoked: since no action was taken for failing CP 3 
within the last ESL interval (T*i0.i •T*20.u 'see 
fig. 2A at 22), a recovery action is tanen. tne ^ncex 



5 



EP 0 351 536 ^2 



^:r IP 3 .3 r :rer^'e!"*6C3 :c 2 fig. 3 at 31). arc fhe 
-^rENC 3 s:^;n^;^eG :o CP 3 i32). T'-e ::rne 
■'-20 2) s 5ave<3 in LAI for CP3 (26), and 
-AS'CT .23). At 1^22. CP 1 again oetects r.e 
ixc-E':on rf another ESL mtervai, calls E3LP, 
■•'-^'•c- -3X53 .-^0 ac:!0f3 5inc9 ac::cn .-/as 'aken fcr 
:p 3 .viT.in :.-e oast SSL .rtervai (fig. 2A at 22). 
"-e :i-"e T-22.M 'S savec ;n LASTDT (23). Also 
a: *-22.: C? ^ :-r:e::$ :r:e axcira:;on of an ESL 
"trrvai. calls E5LR. .vncn innrr-eciateiy returns 
5;rc5 E5LP .s already running on CP 1 (fig. 2A at 
2n. Ar ' -23.1. CP 2 notes the passing of an ESL 
trtervai. rails ESLR. ■.vhlch takes no action since 
acncn was taken :or CP 3 within jhe last ESL 
'nerval .fig. 2A at 22). The time (T-^23.2) is saved 
:n laSTOT (23). At time T-t-30.2. CP 0 detects the 
passage of anotner ESL interval (the ABEND sig- 
nalled :o CP 3 at 1^20.2 has not resolved the 
prcbiem on CP 3). calls ESLR. which, since no 
action was taken for CP 3 within the last ESL 
'Hter/ai. increments CP 3's index (fig. 3 at 31) to 3. 
then Signals "Terminate" to CP 3 (33). Time 
(T + 30.3) ts saved m the LAT entry for CP 3 (26) 
ana m LASTOT (28). Note that in this example, the 
Terminate action against the unit of work on CP 3 
resolves ;h9 spin loop on CPs 0, f and 2. At 
T-^32.1 CP 4. detecting the expiration of another 
ESL interval • T*32J) calls ESLR. ESLR. 

realizing that no action was taken for CP 5 within 
the 'ast ESL interval (T + 22.1 -1 + 32.1; LAT for CP 
5 iS Ti-12.1). but there was an ESL detected 
against some CP within the last two ESL intervals 
(fig. 2A at 23). ESLR increments the index asso- 
ciated with CP 5 to 2 (fig. 3 at 31) and signals 
ABEND to CP 5 (32). The time {T + 32.2) is saved 
in LAT for CP S (26), and in LASTOT (28). In the 
example, the ABEND action against the unit of 
work on CP 5 resolves the spin loop on CP 4. 

Claims 

1. In a multiprocessing system complex com- 
prising at least two processors, an operating sys- 
tem, and resources shared among processors, a 
method for recognition of and recovery from exces- 
sive spin loops by the operating system compris- 
ing: 

A) detecting, by a delecting routine in a first 
processor, that said first processor has been in a 
spin loop requiring a resource held by a resource- 
holding routine in another processor for a fixed 
time period: 

B) identifying a target processor in said sys- 
tem complex as a target for responsive recovery 
action; 

C) performing no responsive recovery ac- 
tions if a bypass indicator set by a routine in said 



ieccr.c crccessor so raicates; 

Oi automatically pencrm.mg ■cr £=;-: -arr-r: 
processor :re of a hierarcnicai iecierce :t -z- 
scorsive crcgrammeo ."ecovery ac:;crs • saic :,- 

5 :;a5s (nc:catcr ^s erf: 

E) -rcrttnuirg :o .oentify said target irc ': 
perform, subsequent hierarcnicai recovery ac:::rs 
icr saiQ target processor jniil said target zrccBSz-zr 
• $ no ;crger so icentifiec as saic target: 

■ J r) ccntinuirg to so cetecr the rc^Cirg :f any 

of said .'esources 'or saio rixea tirre cer cc arc :c 
icenttfy target prccesscrs ana perform :arger 
processor-specific hierarchical recovery ac::crs -r- 
til all of saio resources are acquirea by an retact- 

■5 ,ng processors. 

2. The method of claim l m whicn a sue se- 
quent one of said recovery actions m saio Hierar- 
chical sequence is performed for said target rrc- 
cesser only if an immediately preceding one of 

20 said hierarchical recovery actions has been per- 
formed for said target processor :onger ago than 
one of said fixed time periods, 

3. The method of claim 2 in which said subse- 
quent action in said hierarchical sequence is per- 

25 formed if there has been said detecting of one of 
said spin loops requiring one of said resources 
held by any of said processors in said muitioroces- 
sing complex within two of said fixeo time periods, 
and in which an initlai one of said hierarchical 

jO . actions is performed otherwise. 

4. The method of claim 3 in which saic merar- 
chical sequence comprises the action of aonor- 
mally terminating said routine in said target proces- 
sor in a manner that permits a resource-noiaing 

:5 routine :n said target processor to resume normai 
execution after cleanup. 

5. The method of claim 3 in which said hierar- 
chical sequence comprises the actions of: 

A) continuing to wait for said resource to be 
"to released for a second fixed time penod: 

3) abnormally terminating a resource-holding 
routine in said target processor in a manner that 
permits said routine in said target processor to 
resume normai execution after cleanup; 
45 C) terminating said resource-holding routine 

in said target processor in a manner that does not 
permit said routine in said target processor to re- 
sume normal execution; 

0) removing said target processor from said 
50 multiprocessor system complex. 

6. The method of claim 3 in which said hierar- 
chical sequence comprises the following actions, in 
the order listed; 

A) continuing to wait for said resource to be 
55 released for a second fixed time period; 

8) abnormally terminating said resource- 
holding routine in said target processor in a man- 
ner that permits said routine in said target proces- 
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sec :■: -esw^-.e i-ccr-.ai e^ec'^Eicn a-er :;earuc: 

'.emirarir.g saic resCLrcs-^O'Oirg rcutire 
r saiC :arget prccessor in a ^anrer -hat zees rr[ 
:^trr.\i taiC routine r. saia larget orcc=S;cr :o re- 
i-^r^^e ncrmal executicn; 5 

Ii removing sa:d :arget processor fronn saio 
TiL:ticrccessor System ccmpiex. 

' :-; = -r..jt;:crcc3SS;ng System complex 00m- 
-r;s:ng at eas: v-^o :rocassors. an operating sys- 
:em, =nc resources sharec among processors, a 'O 
mecnanism ror .'ecogninon or and recovery from 
excessive spin icccs oy ;he operating system com- 
cnsmg: 

A) ::letect;on m.ears rcr detecting that a 'irst 
crccessor nas ceen in a scm loop requiring a :5 
-esource heid cy a routine in a secono processor 
^cr a fixed time penoa; 

3) iaentificaticn means for identifying a tar- 
get processor in saio system complex as a target 
for responsive recovery action wnen said detecting 20 
means detects said spin loop; 

C) a processor-bypass indicator associated 
^ith each of said processors and having an "on" 
setting and an "off" semng, said bypass indicator 
being set to said "on" setting when an exempt 25 
routine is executing m said processor associated 
with said "on" bypass indicator; 

C) responsive recovery means for freeing 
said resource held by said target processor only if 
said processor-bypass indicator associated with :o 
said target processor is "off". 

3. The mechanism of claim 7 in which said 
responsive recovery means comprises a hierarchi- 
cal set of recovery functions, which further com- 
onse an A3E: iC'U;ggering function for causing a 35 
-esource-'-cicirg 'outine executing in said target 
prcce^icr :c acre .-mail y terminate, allowing retry. 

I --i Tiecnanism of claim 7 in which said 
:'eic:"i ".9 -ecovery means comprises a hierarchi? 
cai iz' cf recovery functions, said functions com* ^0 

^) a soin function for permitting said first 
c.'ccassor to remain in said spin loop for a second 
:.xec :!me period: 

31 an ABEND-triggering function for causing 45 
a -eicurce-holding routine executing in said target 
c'ocessor to abnormally terminate, allowing retry; 

C) a TERMINATE-triggering function for 
causing a resource-holding routine executing in 
said target processor to terminate without retry; so 

0) an ACR function for removing said target 
processor from said multiprocessor system com- 
plex. 

10. The mechanism of claim 9 further compris- 
ing means for causing successive detections of 55- 
said spin loop fixed time periods resulting in iden- 
tification of the same target processor or a different 
target processor to cause invocation of one of said 



recovery func::ons. ia\c -zzz.-i', 
pvoKed tn 'he croer .^.5 I Z *':' : 
:arget orccesscr. r me 'J laic ''■-"c:.:^ 
/okec 'ess recently :ran ~.a\C 'i.<ec " ~e 
said identified target crccessor. 

11. The mechanism cf raim 10 ''_r 
pnsirg means for causing a successf.e 
of said spin :oop nxec jr^e cencci :c. 
cetection :o :nvcke a secuen-iai 'rc:/erv 
for said identified :argef crccesscr -ver 
csssive detection cccurs .v'*n;n 2 'ixec :■ 
vais of said prior cetection. 
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